As per Monash’s integrity rules this assignment needs to be completed independently and not shared beyond this class.
🔑 Instructions
This is an open book assignment, and you are allowed to use any resources that you find helpful. However, every resource used needs to be appropriately cited. You can add the citations to the end of the report, the particular style is not important. Lack of citation of resources will incur up to 50% reduction in final score.
You are encouraged to use Generative AI, so that you become accustomed to where it is helpful and where it is problematic on topics related to visualisation for machine learning. You are expected to include a link to the full script(s) of your conversion at the end of your report.
This is an exercise in conducting reproducible analyses/research. You need to turn in multiple files, compiled into one zip file, to be emailed to dicook@monash.edu:
quarto (.qmd)
html
additional supporting files such as assignment.css, setup.R, and any data files. It is expected that the rendering the qmd will produce the html file submitted. If the qmd file does not render, then the score for assignment will be reduced by 25%. (If your final file size is too large for email, we can use a shared drive.)
R code should be folded so that it can be examined if interested but otherwise is hidden in the final report.
You will use this .qmd file for writing your answers in the unilur blocks. You will need to install the unilur extension to get the formatting of your solutions done correctly. Follow the instructions at https://github.com/ginolhac/unilur, which says in the Terminal window of your RStudio GUI, type quarto add ginolhac/unilur.
This assignment is worth 40\(\times \frac{1}{3}\)% of your overall score for the unit.
DUE DATE: 4:30pm Friday May 30, 2025
Exercises
1. Chapter 1, question 3, data sets c1 and c3
Solution
cat("\014")
rm(list =ls())gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1245581 66.6 2354692 125.8 2354692 125.8
Vcells 3265472 25.0 8388608 64.0 5533732 42.3
rm(list =ls()) # Removes all variables from memorygc() # Frees up unused memory
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1246034 66.6 2354692 125.8 2354692 125.8
Vcells 3266452 25.0 8388608 64.0 5533732 42.3
# Load necessary libraries and datalibrary(mulgar)library(tourr)data("c1")data("c3")X_c1 <-scale(c1[, 1:6])X_c3 <-scale(c3[, 1:10])#Analysis for C1## Run a grand tour animation for c1cat("Running grand tour for c1...\n")
Running grand tour for c1...
animate_xy(X_c1, grand_tour())
animate_xy(X_c3, grand_tour())
Grand Tour Analysis of Dataset c1
Initial Observation: One of the most striking features in the early stages of the tour animation was the presence of three dense regions where points consistently concentrated. These zones appeared clearly separated in multiple projections, providing early evidence of cluster structure in the data. The persistence and stability of these concentrated regions across many frames suggest that the clustering is not an artifact of specific projections, but rather a robust, intrinsic property of the dataset in high-dimensional space. This initial visual cue served as the foundation for deeper insights into the geometric organization of the data, particularly the discovery that each of these clusters adheres closely to a distinct one-dimensional structure.
Clusters: Multiple projections revealed the presence of at least two to three dense groups of points. Interestingly, within each projection, these groups often appeared as narrow, linear formations, suggesting that each cluster lies approximately along a distinct one-dimensional subspace. This pattern indicates that the data do not spread arbitrarily in the full 6D space, but rather are organized along several dominant directions.
Implication of 1D Clusters: The presence of clusters aligned with separate lines implies that, although the data live in a 6-dimensional space, their effective dimensionality is much lower. Each group can be well approximated by a single direction (i.e., a 2D subspace), and the overall dataset likely spans a union of two or three such directions. While each cluster observed in the grand tour appears to lie approximately along a one-dimensional subspace—suggesting strong internal correlation—the presence of multiple such clusters aligned along different directions implies that the data as a whole cannot be reduced to a single dimension without significant loss of structural information. If the cluster directions are not coincident or parallel, projecting all data onto a single axis may collapse distinct clusters into overlapping regions, obscuring their separation and distorting their internal geometry.
Outliers: In some projections, individual points were positioned far from the central mass of their respective clusters, suggesting the presence of outliers. These points may indicate deviations from the otherwise clean linear structure of the clusters.
Dimensionality Patterns: The variation in point dispersion across projections reinforced the idea that the data lie on a low-dimensional, structured manifold embedded in (^6). The observed geometry suggests a union of multiple 1D submanifolds, each corresponding to a coherent behavioral mode in the data.
Overall, the grand tour revealed that the dataset c1 exhibits strong cluster structure, with each cluster approximately constrained to a linear subspace. This insight supports the use of dimension reduction and structured modeling techniques in downstream analysis, and highlights the power of dynamic visualization in uncovering latent geometric organization in high-dimensional data.
Grand Tour Analysis of Dataset c3
Initial Observations: In the initial projections of the grand tour for the c3 dataset, the data exhibited a remarkably clear triangular shape, with points densely distributed along and within a triangular region. Rather than forming distinct, compact clusters, the data appeared to continuously fill out a 2D geometric structure with three visibly extended arms or corners. This suggests that the observations are organized according to underlying constraints or relationships that confine them to a triangular subset of the high-dimensional space.
Potential Basis for Data Generating Process: The triangular structure observed in the 2D projections strongly suggests that the data may be generated from one of two related mechanisms.
First, the data could arise from a uniform distribution over a 2D simplex, i.e., a triangle defined by three fixed vertices ( A, B, C ^2 ). In this case, each observation is formed as a convex combination:
This is the 2-dimensional simplex in (^3), and when projected into 2D, it naturally forms a triangle.
Rotational Dynamics: As the tour progresses and the viewing direction rotates, the initially crisp triangular shape becomes less apparent in some projections—sometimes collapsing into more circular or amorphous patterns. However, this change reflects the effect of projection rather than a loss of structure. The persistent re-emergence of the triangular geometry in multiple views reinforces the interpretation that the data live on a low-dimensional, non-linear surface, with the triangle acting as a projection of that surface into two dimensions.
Dimensionality and Manifold Structure: The triangular configuration strongly suggests that the data lie on a two-dimensional manifold embedded in (^6). The existence of sharp corners and directional spread indicates that variation is not isotropic, but constrained along a few dominant axes. This implies that dimensionality reduction methods—particularly those that preserve local structure (e.g., PCA, Isomap, UMAP)—could recover this triangular form and clarify the role of the original variables in shaping it.
Implications: The presence of this geometric form points to coordinated relationships among variables, possibly arising from mixtures, bounded proportions, or compositional data effects. While no discrete clusters are evident, the structure is highly informative and could reflect latent regimes or generative processes. Identifying the variable combinations that produce the triangle’s edges or corners could yield insights into which dimensions drive the primary variation, aiding interpretation and guiding downstream modeling choices.
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1278054 68.3 2354692 125.8 2354692 125.8
Vcells 3341808 25.5 8388608 64.0 5533732 42.3
# ─────────────────────────────────────────────────────────────# Exercise 4 a USArrests – grand-tour exploration# • animate the tour on the raw variables# • repeat on z-scored variables (rescale = TRUE)# • quick Mahalanobis check for possible outliers# ─────────────────────────────────────────────────────────────# 0. packages --------------------------------------------------------------library(tourr) # animate_xy(), grand_tour()library(datasets) # USArrests data set# 1. load data -------------------------------------------------------------data("USArrests") # 50 × 4 numeric matrix# 2. grand tour – original scale ------------------------------------------animate_xy(data = USArrests,tour_path =grand_tour(), # random grand tourfps =60, # smooth animationaxes ="bottomleft",rescale =FALSE, # keep original scaletitle ="Grand tour — USArrests (original scale)")
# 3. grand tour – standardised variables ----------------------------------animate_xy(data = USArrests,tour_path =grand_tour(),fps =60,axes ="bottomleft",rescale =TRUE, # z-score each variabletitle ="Grand tour — USArrests (standardised)")
# 4. quick outlier flag in the standardised space -------------------------ua_scaled <-scale(USArrests)d2 <-mahalanobis( ua_scaled,center =colMeans(ua_scaled),cov =cov(ua_scaled))cutoff <-qchisq(0.975, df =ncol(ua_scaled)) # 97.5 % χ² cutoffpossible_outliers <-names(d2)[d2 > cutoff]possible_outliers
[1] "Alaska" "North Carolina"
🌀 Grand Tour Analysis — USArrests (Exercise 4a)
We used a grand tour to visualize the USArrests dataset in both its original and standardized forms. The dataset includes 50 U.S. states described by 4 numeric variables: Murder, Assault, UrbanPop, and Rape.
🔹 Tour on Original Scale
The animation shows a long needle-like shape, indicating that the variance is heavily dominated by one or two variables — likely Assault and Murder.
Most points are tightly clustered with little visible spread except along a primary axis.
This makes it hard to detect finer structure or outliers, as scale differences distort the geometry.
🔹 Tour on Standardized Scale
After applying standardization (rescale = TRUE), all four variables contribute equally to the geometry.
The point cloud spreads more evenly across directions, revealing clearer multivariate structure.
No tight clusters are observed, but the shape becomes more disc-like or elongated, with visible directional variation.
🔎 Outlier Detection (Mahalanobis Distance)
Using Mahalanobis distance on the standardized data, two states stand out:
possible_outliers # [1] “Alaska” “North Carolina”
3. Chapter 2, questions 1-4, using set.seed(1257).
Solution
Question 1.
##| code-fold: trueset.seed(1257)# Set dimensionsn_rows <-5n_cols <-4# Generate the matrix with standard normal entriesmat <-matrix(rnorm(n_rows * n_cols), nrow = n_rows, ncol = n_cols)# Print the matrix (optional)print(mat)
#Make them orthonormal and checkmat <- tourr::orthonormalise(mat)tourr::is_orthonormal(mat[,1])
[1] TRUE
tourr::is_orthonormal(mat[,2])
[1] TRUE
#To determine the contribution we will look at the absolute value.abs_mat <-abs(mat)#This will give us the largest contribution of each row on the vectors. apply(abs_mat, 2, which.max)
[1] 3 5 4 1
# Identify which rows (dimensions) contribute most to each column (basis vector) - verticalvertical_contrib <-apply(abs_mat, 2, which.max)# Identify which columns (basis vectors) each row contributes most to - horizontalhorizontal_contrib <-apply(abs_mat, 1, which.max)# Print resultsprint("Vertical contributions (which row contributes most to each column):")
[1] "Vertical contributions (which row contributes most to each column):"
print(vertical_contrib)
[1] 3 5 4 1
print("Horizontal contributions (which column each row contributes most to):")
[1] "Horizontal contributions (which column each row contributes most to):"
print(horizontal_contrib)
[1] 4 3 1 3 2
Question 3.
##| code-fold: truedata("clusters")head(clusters)
x1 x2 x3 x4 x5 cl
1 0.2436221 -0.9835212 -0.2995361 -0.2660927 0.95810597 A
2 0.4195025 -1.1348318 -0.1322796 -1.3976228 -0.04503262 A
3 -1.2490844 -0.9748305 -0.1725720 -1.5624140 1.36884226 A
4 0.8093173 -1.2599113 -0.3848157 -1.3266300 -0.45883551 A
5 -0.3982195 -0.7862155 0.2287421 -1.5929849 1.08340577 A
6 0.1091388 -1.2605015 -0.3409230 -0.9807830 0.26497517 A
clusters <-as.matrix(clusters[,1:5])twodproject <-t(t(mat) %*%t(clusters))# Make the projection a data frame for plottingtwod_df <-as.data.frame(twodproject)colnames(twod_df) <-c("Comp1", "Comp2")# Scatterplot of the projectionplot( twod_df$Comp1, twod_df$Comp2,xlab ="Component 1", ylab ="Component 2",main ="2D Projection of Clusters Data",pch =19, col ="blue")
Using the image there is not enough evidence for the presence of clustering.
Question 4
##| code-fold: true# Confirm you’re using only 5 variables# Load and inspect the datadata("clusters")head(clusters)
x1 x2 x3 x4 x5 cl
1 0.2436221 -0.9835212 -0.2995361 -0.2660927 0.95810597 A
2 0.4195025 -1.1348318 -0.1322796 -1.3976228 -0.04503262 A
3 -1.2490844 -0.9748305 -0.1725720 -1.5624140 1.36884226 A
4 0.8093173 -1.2599113 -0.3848157 -1.3266300 -0.45883551 A
5 -0.3982195 -0.7862155 0.2287421 -1.5929849 1.08340577 A
6 0.1091388 -1.2605015 -0.3409230 -0.9807830 0.26497517 A
# Confirm you’re using only 5 variablesX <-scale(clusters[, 1:5]) # Use only the first 5 numeric columnsdim(X) # should return [1] n 5
[1] 300 5
tour_path <-save_history(X, grand_tour(d =2), max_bases =3)tour_path <-as.array(tour_path)second_basis <-matrix(tour_path[,,2], nrow =5, ncol =2)twodproject <-t(t(second_basis) %*%t(X))# Make the projection a data frame for plottingtwod_df <-as.data.frame(twodproject)colnames(twod_df) <-c("Comp1", "Comp2")# Scatterplot of the projectionplot( twod_df$Comp1, twod_df$Comp2,xlab ="Component 1", ylab ="Component 2",main ="2D Projection of Clusters Data",pch =19, col ="blue")
Cluster projection summary:
The projection of the clusters dataset onto the 2D plane defined by A shows partial separation of groups. Some clusters appear more compact than others, but overlaps remain. There is moderate visual evidence of underlying group structure in the projected space.
Second basis from grand tour: [,1] [,2] [1,] -0.17379 0.54047
[2,] 0.22696 0.10462
[3,] -0.52068 0.65291
[4,] -0.55584 -0.42897
[5,] -0.57763 -0.27972
We have already analyzed the this datasets deeply in question 1), nevertheless, additional to that I am adding the correlation plots and densities such that we get a better idea on the relationship.
First, we load the data
cat("\014")
rm(list =ls())gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 1283791 68.6 2354692 125.8 2354692 125.8
Vcells 3355651 25.7 8388608 64.0 5616802 42.9
# Frees up unused memory# Load librarieslibrary(tourr)library(mulgar)library(GGally)# Load and scale datadata("c1")data("c3")X_c1 <-scale(c1[, 1:6])X_c3 <-scale(c3[, 1:10])
Analysis for c1
ggpairs(as.data.frame(X_c1),title ="Scatterplot Matrix with Correlations - c1")
The scatterplot matrix and grand tour animation both indicate that the dataset does not occupy the full 6-dimensional ambient space.
Several variable pairs show strong linear relationships: \[
\text{Corr}(x_3, x_4) = 0.767, \quad \text{Corr}(x_1, x_2) = -0.247, \quad \text{Corr}(x_2, x_3) = 0.331.
\] These patterns suggest that the data matrix \[ X \in \mathbb{R}^{n \times 6} \] likely lies near a lower-dimensional linear subspace: \[
\dim(\text{span}(X)) < 6.
\]
The grand tour reveals three distinct clusters, each forming narrow, nearly linear streaks in different projections. This suggests that each cluster lies along a separate one-dimensional subspace. The global structure, therefore, is a union of multiple 1D submanifolds embedded in a lower-dimensional affine subspace of \[ \mathbb{R}^6 \], potentially of dimension 2 or 3. If \[ \lambda_1, \dots, \lambda_6 \] are the eigenvalues of the sample covariance matrix \[ \Sigma_X \], and \[ \sum_{j=1}^{3} \lambda_j \gg \sum_{j=4}^{6} \lambda_j, \] then most of the variability is captured by the first three components, supporting the low-dimensional structure hypothesis.
It seems, the dataset \[\texttt{c1}\] exhibits clear signs of intrinsic low dimensionality, and its geometric structure is consistent with a union of a few dominant linear directions. Dimension reduction techniques such as PCA or projection pursuit would be effective for uncovering and analyzing this structure.
ggplot(c1, aes(x = x2, y = x1)) +geom_jitter(width =0.2, height =0, alpha =0.6, size =2) +geom_smooth(method ="lm", se =FALSE, linetype ="dashed") +scale_x_continuous(breaks =unique(c1$x2)) +theme_minimal(base_size =14) +labs(title ="Scatterplot of x₁ vs x₂",x =expression(x[2]),y =expression(x[1]) )
On the other hand, the correlation structure between variables, reveals some interesting patterns. For example,in the scatterplot between \(x_1\) and \(x_2\), there is an average correlation of \(r \approx -0.247^{***}\). The variable \(x_2\) is discrete, taking values in \(\{-1,0,1,2,3,\dots\}\), which produces vertical columns of points. Within each level of \(x_2\), the variability of \(x_1\) is substantial and even exhibits bimodal structures. Although there is a general decreasing trend of \(x_1\) as \(x_2\) increases, the point cloud does not form a straight band, suggesting the presence of subpopulation heterogeneity and indicating that \(x_2\) should be modeled as a factor or that explicit clustering effects be considered.
On the other hand, this piecewise analysis reveals some potential evidence against low dimmensional structures, moderate linear association \(r \approx -0.247\) implies only \(r^2 \approx 0.061\) of the variance in \(x_1\) is captured by \(x_2\), which is insufficient for a faithful one‐dimensional representation. Moreover, since \(x_2\) is discrete on \(\{-1,0,1,2,3,\dots\}\), the joint support forms vertical strips rather than a smooth manifold, violating continuity assumptions of many nonlinear embeddings. The substantial—and at times bimodal—spread of \(x_1\) within each level of \(x_2\) further indicates intra‐level heterogeneity that would be lost under a univariate projection. Consequently, a true lower‐dimensional manifold is not evident, and one should instead consider modeling \(x_2\) as a factor or employing clustering on \((x_1,x_2)\) rather than collapsing them into a single continuous dimension.
The scatterplot matrix shows one notably strong linear association between (x_3) and (x_4) ((r^{})) and a moderate association between (x_1) and (x_3) ((r^{})), but most other pairwise correlations are weak ((|r|<0.3)). Additionally, (x_2) is discrete with clustered support, and several variables exhibit heterogeneous or bimodal spreads within levels. Thus, while a single principal component could capture the dominant (x_3)–(x_4) subspace, a strictly linear reduction to one or two dimensions would discard important variance elsewhere. A more effective strategy is to model the (x_3)–(x_4) subspace separately, treat discrete variables categorically, or apply clustering to preserve the data’s intrinsic structure.
Analysis for c3
ggpairs(as.data.frame(X_c3),title ="Scatterplot Matrix with Correlations - c3")
The majority of linear correlations are close to zero, which reinforces the idea that the structure is nonlinear and bounded. This is consistent with data that satisfy compositional or convex constraints.
The scatterplot matrix reveals very weak linear associations among most variable pairs, with nearly all correlation coefficients falling within the range \([-0.05, 0.05]\). This lack of linear dependence suggests that the data does not exhibit a dominant low-dimensional linear subspace. However, the shape of the scatterplots—many of which show sharp edges and triangular boundaries—indicates that the data may be confined to a bounded, convex region of the space.
The repeated appearance of triangular and simplex-like structures across pairwise projections suggests that the data lies near a lower-dimensional manifold, possibly two- or three-dimensional, embedded within the ambient 10-dimensional space. Despite the lack of linear correlations, the data appears to inhabit a geometric constraint that limits the degrees of freedom across variables.
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2543892 135.9 4450727 237.7 4450727 237.7
Vcells 5564499 42.5 12255594 93.6 8937235 68.2
# Cargar librerías necesariaslibrary(tourr)library(mulgar)library(dplyr)library(ggplot2)# 1. Cargar y estandarizar el dataset anomaly12data(anomaly2)data <- anomaly2data_std <- data %>%mutate_if(is.numeric, ~ (.-mean(., na.rm =TRUE))/sd(., na.rm =TRUE))rm(data,anomaly2)# 2. Aplicar PCApca_result <-prcomp(data_std, scale. =FALSE)# 3. Mostrar el scree plot para determinar cuántos PCs retenerggscree(pca_result, q=4) +ggtitle("Scree plot for anomaly12")
# 4. Visualizar los primeros PCs con un tour# Seleccionamos las primeras k PCs según el scree plot visualmentepc_data <-as.data.frame(pca_result$x[, 1:3]) # Ajusta 1:5 si hace falta# 5. Tour para ver si la anomalía se detecta en los primeros PCsanimate_xy(as.matrix(pc_data), grand_tour())
Based on the scree plot, we selected the first three principal components (PC1–PC3), which account for most of the variation in the dataset.
We then used a tour to visualize the projection of the data in the reduced 3D PCA space. The point cloud appears uniformly filled, with no visually obvious anomaly or outlier across any of the projections involving PC1–PC3.
This suggests that the anomaly is not visible in the first 3 principal components. Therefore, the anomaly likely lies in the fourth principal component (PC4), which captures a small amount of variance and would typically be discarded under standard PCA-based dimension reduction.
6. Chapter 5, question 4
Solution
# Load required librarieslibrary(ggplot2)library(dplyr)library(Rtsne)library(uwot)library(tourr)library(detourr)library(patchwork)# Use cleaned numeric dataset# Uncomment the next line if 'apar' is not defined yet:apar <- aflw %>%select(where(is.numeric)) %>%na.omit()apar <-scale(apar)# PCApca_result <-prcomp(apar, scale. =TRUE)pca_2d <-as.data.frame(pca_result$x[, 1:2])colnames(pca_2d) <-c("PC1", "PC2")# t-SNEset.seed(42)tsne_result <-Rtsne(apar, dims =2, perplexity =30)tsne_2d <-as.data.frame(tsne_result$Y)colnames(tsne_2d) <-c("tSNE1", "tSNE2")# UMAPset.seed(42)umap_result <-umap(apar, n_neighbors =15)umap_2d <-as.data.frame(umap_result)colnames(umap_2d) <-c("UMAP1", "UMAP2")# PCA plotp1 <-ggplot(pca_2d, aes(x = PC1, y = PC2)) +geom_point(alpha =0.6) +ggtitle("PCA") +theme_minimal()# t-SNE plotp2 <-ggplot(tsne_2d, aes(x = tSNE1, y = tSNE2)) +geom_point(alpha =0.6) +ggtitle("t-SNE") +theme_minimal()# UMAP plotp3 <-ggplot(umap_2d, aes(x = UMAP1, y = UMAP2)) +geom_point(alpha =0.6) +ggtitle("UMAP") +theme_minimal()# Show all three side by side(p1 + p2 + p3)
🔁 Updated Analysis After Normalization
After normalizing all numeric variables in the AFLW dataset, we re-ran PCA, t-SNE, and UMAP. This ensures each feature contributes equally, avoiding distortions caused by differences in scale (e.g., metres dominating over binary or rate-based variables). The results yield clearer and more balanced representations.
🧮 Principal Component Analysis (PCA): Interpreting Loadings
The first two principal components after normalization provide meaningful axes of variation:
uncontested and contested (−0.277), turnovers (−0.268), handballs (−0.229), clearances (−0.233)
Interpretation: PC1 captures overall engagement with the ball, particularly volume-based actions. Players with low PC1 scores (left in the PCA plot) are highly active in possession and ball movement.
Interpretation: PC2 separates attacking threats (high scores) from defensive interceptors (low scores). Players at the top of the PCA plot are goal-oriented, while those at the bottom contribute more to defense.
🧭 Interpreting t-SNE: Correlation with Original Variables
t-SNE doesn’t yield loadings, but variable correlations with its 2D embedding allow interpretation:
Interpretation: tSNE2 encodes role orientation. Lower values indicate forwards and scorers, while higher values reflect defensive rebounders and interceptors.
🌐 Interpreting UMAP: Correlation with Original Variables
Like t-SNE, UMAP is nonlinear and lacks loadings, but correlations with original features reveal latent structure:
Interpretation: UMAP2 separates defensive contributors and midfielders (high values) from more static or front-line players.
📊 Summary Table
Axis
Main Interpretation
Strong Contributors
PC1
General involvement and ball movement
Disposals, kicks, metres, possessions
PC2
Offensive vs. defensive specialization
Shots, goals, accuracy vs. intercepts
tSNE1
Volume and intensity of play
Disposals, kicks, possessions, metres
tSNE2
Attacking vs. defensive role contrast
Shots, accuracy, goals vs. intercepts
UMAP1
Offensive orientation / scoring threat
Shots, marks_in50, accuracy, goals
UMAP2
Midfield & defensive involvement
Intercepts, rebounds, uncontested, disposal
🧠 Key Insight
After normalization, the relationships between players and their attributes are more balanced and interpretable:
PCA reveals two major, interpretable axes: involvement and role type.
t-SNE exposes fine-grained differences in activity and specialization, clustering similar players.
UMAP provides the clearest segmentation into latent roles, such as attackers, midfielders, and defenders.
Together, they offer a consistent and nuanced understanding of player performance archetypes in the AFLW dataset.
✨ Comparing PCA, t-SNE, and UMAP Representations of AFLW Data
This analysis compares three different 2D representations of the AFLW dataset: PCA, t-SNE, and UMAP. All are based on normalized numeric variables to ensure fair contribution from all features. Our goal is to interpret what each method reveals about the structure of the data and player types, and how those insights align with the dynamic projections offered by a tour (via liminal or detourr).
🧮 Principal Component Analysis (PCA)
The PCA projection (left panel in the plot) shows a diffuse elliptical cloud, with no strong clustering but a clear directional spread. Based on loadings:
PC1 captures general involvement and ball activity:
High negative loadings on disposals, possessions, kicks, metres, and turnovers
PC2 reflects offensive vs. defensive specialization:
Positive loadings on shots, goals, accuracy, and assists
Negative loadings on intercepts and rebounds_in50
The PCA plot indicates a continuous spectrum of player roles, from highly active ball users to more peripheral or specialized roles. There’s no clear cluster separation, but players vary along interpretable gradients.
Axis Relationship: PC1 and PC2 are orthogonal by design. This leads to a clean decomposition of variance, with little correlation between the axes—confirming that involvement and role specialization are independent in the data.
🔀 t-SNE Representation
The t-SNE embedding (middle panel) reveals a dense central mass with slight curvature, forming a rough horseshoe or hook shape. It preserves local neighborhoods, helping us see finer distinctions between similar players.
Based on correlations:
tSNE1 aligns strongly with volume-based activity: disposals, kicks, possessions, metres
Unlike PCA, the t-SNE plot suggests soft local clustering—players with similar styles are close, but not in clearly separated blobs. The structure hints at latent manifolds, especially a nonlinear activity-to-role continuum.
Axis Relationship: The S-shaped curvature suggests nonlinear correlation between tSNE1 and tSNE2, reflecting interaction between activity level and specialization. The axes are not interpretable independently in the same way as PCA.
🌐 UMAP Representation
The UMAP projection (right panel) displays more distinct lobe-like branches, indicating sharper separations between latent groups. This structure aligns with UMAP’s ability to preserve both local and some global structure.
UMAP2 separates defensive contributors (intercepts, rebounds_in50) from more offensive or forward-oriented roles
UMAP reveals clearer player archetypes than t-SNE and PCA. The branches likely represent clusters such as: attacking forwards, rebounding defenders, mobile midfielders, and low-involvement players.
Axis Relationship: UMAP1 and UMAP2 show nonlinear interaction, with several curved branches fanning out in distinct directions. The space isn’t globally orthogonal but captures interacting traits.
🎥 Using the Tour (liminal / detourr)
Both liminal and detourr enable exploration of high-dimensional data projections over time. By linking a tour to the PCA, t-SNE, and UMAP embeddings, we observe:
PCA emphasizes variance, but the tour shows that important structure may lie beyond PC1–PC2
t-SNE and UMAP provide sharper 2D representations, but lack direct interpretability of projection vectors
The tour reveals smooth transitions between player clusters that neither t-SNE nor UMAP can show as trajectories—they provide static embeddings
Thus, combining static plots with interactive tours gives a more complete view: PCA and the tour give interpretability and directionality, while t-SNE and UMAP expose latent nonlinear structures and soft clusters.
🧠 Conclusion
Method
Key Strength
What We Learn
PCA
Variance decomposition
Axes correspond to activity and role
t-SNE
Local similarity
Continuum of player styles
UMAP
Latent grouping
Clear archetypes emerge
Tour
Dynamic exploration
Explores hidden dimensions beyond 2D
Combining all four tools allows us to understand both the geometry and meaning of high-dimensional player data in the AFLW, helping us uncover continuous skill gradients, discrete role types, and the underlying structure of performance diversity.
7. Chapter 6, question 3 c1 and c3
Solution
cat("\014")
rm(list =ls())gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2633940 140.7 4450727 237.7 4450727 237.7
Vcells 5772631 44.1 12255594 93.6 9510557 72.6
# Frees up unused memory# Load necessary libraries and datalibrary(mulgar)library(tourr)data("c1")data("c3")X_c1 <-scale(c1[, 1:6])X_c3 <-scale(c3[, 1:10])#Analysis for C1## Run a grand tour animation for c1cat("Running grand tour for c1...\n")
Running grand tour for c1...
animate_xy(X_c1, grand_tour())
animate_xy(X_c3, grand_tour())
Exploratory Cluster Analysis c1
Overview
We conduct a visual clustering analysis of the c1 dataset using a Grand Tour animation. The goal is to detect structural patterns in the 6-dimensional standardized data and identify any latent clusters.
Visual Evidence of Linear Clusters
Across the full sequence of provided frames, the observations consistently organize into two elongated clusters, suggesting that the groups are not spherical but rather linearly shaped. These linear shapes persist across several projections, indicating:
A linear relationship among variables within each cluster.
The clusters are not only separable by direction but also follow distinct linear trends, possibly due to internal correlations between pairs of variables.
This structural property suggests that PCA or linear discriminant analysis may be especially suitable for dimensionality reduction and further analysis.
Variables Driving Separation
From inspection of the Grand Tour frames, we observe the following:
X2 and X5 appear frequently aligned with the separation axis. In several projections, one cluster lies along high values of X5 and low X2, while the other occupies the opposite direction.
X3 and X6 also exhibit considerable influence, particularly in projections where one of the clusters lies nearly along a diagonal that mixes these variables.
X1 and X4 appear less critical to cluster separation, as their axes often align orthogonally to the direction of maximum variance or contribute weakly to separating the groups.
These patterns suggest that variables X2, X3, X5, and X6 are the most informative for distinguishing the latent subpopulations in c1.
Summary Interpretation
The data exhibits strong visual evidence of two linearly structured clusters, meaning that observations within each group follow their own linear trends or relationships between features.
The variables X2, X3, X5, and X6 contribute the most to these patterns and likely drive the clustering.
Clustering algorithms assuming Euclidean distance may underperform unless the data is first projected into a lower-dimensional subspace aligned with the directions of separation.
Follow-up analysis could benefit from applying PCA for feature compression, followed by k-means or model-based clustering.
Recommendation
We recommend performing:
PCA to reduce dimensionality and capture linear structure.
k-means with (k = 2) or Gaussian Mixture Models on the projected space.
Scatterplot matrix of (X2, X3, X5, X6) to visually verify the linear margins between clusters.
The Grand Tour analysis has thus proven effective at revealing not only the existence of clusters in c1 but also their geometry and variable dependence.
Exploratory Cluster Analysis c3
Introduction
This analysis explores the geometric properties of the high-dimensional dataset c3 using Grand Tour visualization. Unlike c1, where clear clustering was observed, the structure of c3 appears fundamentally different and more geometrically constrained. The goal is to interpret the emergent triangular (pyramidal) shape and uncover what it may reveal about the data-generating process.
Observations from the Grand Tour
Using the dynamic projection of the standardized c3 dataset (10 dimensions), the following patterns consistently emerge:
The data points fill a triangular or pyramid-like region, rather than forming separated groups.
There are no empty gaps, but rather a continuous distribution along the edges and interior of a convex hull.
Most data is concentrated in the center, with gradual spreading toward three dominant directions, giving the impression of three poles or extremes.
These features are not indicative of discrete clusters, but rather of a convex, constrained structure where the observations lie within the convex combination of a few extreme points.
Interpretation: A Constrained Data-Generating Process
The triangular/pyramidal shape likely results from an underlying generative mechanism with constraints, such as Mixture models or convex combinations of latent profiles (e.g., soft membership in archetypes). This structure reflects a system where the variation is not unconstrained, and the observed data live inside a low-dimensional convex space embedded in 10D space.
Role of Variables in Defining the Geometry
Inspection of the Grand Tour frames highlights several variables that seem to anchor the triangular form:
X6 consistently points in the direction of one of the triangle’s “edges” or extremes.
X7 and X10 appear to define other important directions, forming the corners of the simplex-like shape.
Variables X1–X5 seem to contribute to variation around the center, not strongly aligning with the outermost structure.
These variables may represent dominant latent traits or factors whose interaction and relative intensity define the space.
Conclusion
The Grand Tour visualization of c3 reveals a constrained, continuous, and convex structure, best interpreted as the product of a triangular or polyhedral data-generating process. The dataset does not exhibit classical clustering, but rather aligns with models of soft membership, mixture compositions, or latent trade-offs. Any further analysis should account for this geometric constraint, avoiding clustering methods that assume spherical or disconnected groups.
To answer the question, first, lets run a gran tour over the data.
cat("\014")
rm(list =ls())gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2634108 140.7 4450727 237.7 4450727 237.7
Vcells 5772971 44.1 12255594 93.6 9510557 72.6
# Load required packageslibrary(mulgar)library(tourr)library(dplyr)data(aflw)# Load and store the dataaflw <- mulgar::aflw# Step 1: Subset the dataset# Let's select the first 100 observations and remove non-numeric variablesaflw_subset <- aflw[,7:35] # Optional: scale the dataaflw_subset_scaled <-scale(aflw_subset)# Step 2: Run a Grand Touranimate_xy(aflw_subset_scaled, tour_path =grand_tour(), fps =60, axes ="center")
Priors on Clustering Structure
The grand tour projection of the scaled AFLW data shows a relatively dense central core of points, with a few observations drifting toward the periphery in specific directions. The variables appear to load consistently in a radial pattern from the center, with no strong outliers or multimodal projections across most frames. Based on this structure, our prior is that the dataset does not exhibit clear, well-separated clusters, but rather may contain subtle elongated structures or gradients. We would expect methods like Ward’s linkage or model-based clustering (e.g., Gaussian mixtures with soft boundaries) to be more appropriate than k-means or single linkage, which assume spherical or chain-like structures respectively.
Based on this, lets see the dendograms.
Dendogram <-function(actualmethod, labeluse) {# Cuerpo de la funcióndata(aflw)# Load and store the data aflw <- mulgar::aflw# Step 1: Subset the dataset# Let's select the first 100 observations and remove non-numeric variables aflw_subset <- aflw[,7:35] # Optional: scale the data aflw_subset_scaled <-as.data.frame(scale(aflw_subset))rownames(aflw_subset_scaled) <-seq_len(nrow(aflw_subset_scaled)) # set row names as 1, 2, ..., n simple_clusters <- aflw_subset_scaled# Step 2: Run a Grand Tour#animate_xy(aflw_subset_scaled, tour_path = grand_tour(), fps = 60, axes = "center")# Compute hierarchical clustering with Ward's linkage cl_hw <-hclust(dist(simple_clusters), method = actualmethod)# Prepare dendrogram layout in ggplot (optional for plotting) cl_ggd <-dendro_data(cl_hw, type ="triangle")# Project the hierarchical structure back onto the original data space cl_hfly <-hierfly(simple_clusters, cl_hw, scale =FALSE)# Add cluster labels to the dataset simple_clusters <- simple_clusters |>mutate(clw =factor(cutree(cl_hw, k =2))) # you can change k as needed# Plot the dendrogram ph <-ggplot() +geom_segment(data=cl_ggd$segments, aes(x = x, y = y, xend = xend, yend = yend)) +geom_point(data=cl_ggd$labels, aes(x=x, y=y),colour="#3B99B1", alpha=0.8) +ggtitle(labeluse) +theme_minimal() +theme_dendro()return(ph)}### p1 <-Dendogram("ward.D2","(a)")p2 <-Dendogram("single","(b)")p3 <-Dendogram("complete","(c)")p1 + p2 + p3
📊 Comparative Analysis of Linkage Methods
We compare the dendrograms produced by three hierarchical clustering linkage methods: Ward’s linkage, Single linkage, and Complete linkage. Each method leads to distinct clustering structures, reflecting their underlying merging criteria.
(a) Ward’s Linkage (ward.D2)
Ward’s method produces a balanced and interpretable dendrogram, where merges occur between clusters with minimal increase in total within-cluster variance. The resulting tree is characterized by:
Compact, spherical clusters.
Relatively evenly distributed heights in the tree, suggesting consistent dissimilarities.
High interpretability, where vertical distance (merge height) reflects actual dissimilarity.
This method is ideal when homogeneous and tight clusters are expected. It also tends to be robust to outliers and noise.
✅ Ward is the most stable and effective for general-purpose clustering.
(b) Single Linkage
Single linkage selects the minimum distance between clusters for merging. The resulting dendrogram exhibits the classic chaining effect, where:
Points are added one-by-one to long, extended chains.
The tree becomes unbalanced and ragged, making interpretation difficult.
Clusters may be loosely connected, lacking true cohesion.
While this method can uncover arbitrary-shaped clusters, it is highly sensitive to noise, often merging dissimilar points early due to isolated links.
⚠️ Single linkage is not recommended when seeking compact, well-separated groups.
(c) Complete Linkage
Complete linkage merges clusters based on the maximum pairwise distance between their elements. This yields:
Tight, conservative clusters with small diameters.
A more structured dendrogram than single linkage, avoiding excessive chaining.
Some sensitivity to outliers, which may delay otherwise natural merges.
This method is a reasonable compromise between Ward and Single linkage, producing interpretable clusters while avoiding excessive fragmentation.
🟡 Complete linkage is useful when cluster compactness is crucial, though not as balanced as Ward.
# =========================================================# 0. Clear environment and load packages# =========================================================library(mulgar)library(tourr)library(dplyr)library(RColorBrewer)library(ggplot2) # <- faltabalibrary(ggdendro) # <- faltaba# =========================================================# 1. Prepare the data# =========================================================data(aflw)X <-scale(aflw[, 7:35]) # only numeric variables, scaledXm <-as.matrix(X) # tourr prefers a plain matrix# =========================================================# 2. Hierarchical clustering (choose your preferred method)# =========================================================hc <-hclust(dist(Xm), method ="ward.D2") # "single" or "complete" if you want to comparek <-2# number of clusters to cutclusters <-factor(cutree(hc, k = k))# Color palette: one distinct tone per clusterpal <-brewer.pal(max(3, k), "Dark2") # Dark2 has good contrastcol_vec <- pal[clusters] # final color vector# =========================================================# 3A. Standard tour (grand_tour) colored by clusters# =========================================================animate_xy( Xm,tour_path =grand_tour(),fps =60,axes ="center",col = col_vec # <- this is where the cluster color is applied)
Cluster Structure Exploration via Grand Tour
The following animation shows a dynamic 2D projection of the aflw dataset using the Grand Tour technique. Each data point is colored according to the output of a hierarchical clustering algorithm with k = 2 clusters, using Ward’s linkage on the scaled numerical variables of the dataset.
Interpretation
Throughout the animation, we observe two apparent groups distinguished by color (orange and green). However, the separation between these two clusters is neither consistent nor strong across most projections. Instead, we see a large amount of overlap, and the boundary between the groups becomes blurred as the view rotates.
This suggests that the algorithm is partitioning the dataset along relatively weak directions of variance that are only evident in certain projections. There is no persistent low-dimensional structure clearly separating the two groups.
Conclusion
Given the lack of stable and well-separated structure across projections, it is more appropriate to interpret this dataset as a single coherent cluster. The clustering algorithm may be overfitting to noise or minor fluctuations in high-dimensional space, rather than identifying meaningful latent groupings.
Thus, further analysis should proceed under the assumption that aflw contains a single population, unless additional domain knowledge justifies a specific segmentation.
9. Chapter 12, project 1-7
Solution
First load the data and prepare:
# ── 1. Clean workspace and load data ─────────────────────────────cat("\014") # clear console
rm(list =ls()) # remove all objectsgc() # run garbage collection
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2664472 142.3 4450727 237.7 4450727 237.7
Vcells 5962647 45.5 12255594 93.6 9510557 72.6
# Download the risk_MSA dataset directly from GitHuburl <-"https://raw.githubusercontent.com/dicook/mulgar_book/master/data/risk_MSA.rds"MSA <-as.data.frame(readRDS(url(url))) # store the full dataset in object MSA# ── 2. Keep only numeric columns ─────────────────────────────────library(dplyr)MSAnumeric <- MSA %>%select(where(is.numeric)) # extract numeric variables onlyMSAnumeric <-scale(MSAnumeric)# ── 3. Quick checks ─────────────────────────────────────────────str(MSA) # structure of the full dataset
'data.frame': 563 obs. of 6 variables:
$ Recreational: int 3 1 2 1 5 5 1 5 1 3 ...
$ Health : int 2 1 2 1 4 2 1 5 1 5 ...
$ Career : int 1 1 1 1 1 5 1 5 1 2 ...
$ Financial : int 2 1 1 1 3 3 1 5 1 1 ...
$ Safety : int 2 1 2 1 5 2 1 5 1 4 ...
$ Social : int 4 1 3 1 5 3 1 5 1 5 ...
Recreational Health Career Financial
Min. :-1.1252 Min. :-1.2096 Min. :-0.984611 Min. :-1.08778
1st Qu.:-1.1252 1st Qu.:-0.3432 1st Qu.:-0.984611 1st Qu.:-1.08778
Median :-0.1797 Median :-0.3432 Median :-0.006946 Median :-0.02823
Mean : 0.0000 Mean : 0.0000 Mean : 0.000000 Mean : 0.00000
3rd Qu.: 0.7658 3rd Qu.: 0.5232 3rd Qu.:-0.006946 3rd Qu.:-0.02823
Max. : 2.6568 Max. : 2.2560 Max. : 2.926048 Max. : 3.15043
Safety Social
Min. :-1.3216 Min. :-0.99179
1st Qu.:-0.2780 1st Qu.:-0.99179
Median :-0.2780 Median :-0.01731
Mean : 0.0000 Mean : 0.00000
3rd Qu.: 0.7655 3rd Qu.: 0.95717
Max. : 2.8527 Max. : 2.90613
Based on the grand tour animation of the MSAnumeric dataset, we can observe the following:
Central Compactness:
The data points are densely concentrated near the origin of the projection space.This indicates a relatively homogeneous dataset with no visible long tails or outlier directions.
Evidence of Discreteness:
The point cloud exhibits a clear “pixelated” or grid-like structure. This suggests that the variables take on a limited number of discrete values. Indeed, each variable in the dataset comes from Likert-style survey responses (typically 1–5). After standardization, these categorical responses map to a small set of z-score levels. When combined in projection, this leads to repeated patterns and stacking of points at specific coordinates.
Lack of Cluster Separation:
Throughout the rotation of the tour, there is no indication of distinct, isolated clusters. The data does not split into separate masses, nor are there visible low-density gaps. The overall geometry forms a single dense sphere, implying no clear subgroup structure.
Conclusion:
The dataset appears to be discrete in nature, with values arising from a fixed set of survey responses. There is no evidence of natural clustering in the numeric representation of the data. Any attempt to cluster this data may require transformation into latent factors or use of non-numeric representations.
2. Hierarchical Clustering different options.
####################################### Load the data ######################################## ── 1. Clean workspace and load data ─────────────────────────────cat("\014") # clear console
rm(list =ls()) # remove all objectsgc() # run garbage collection
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2675790 143.0 4450727 237.7 4450727 237.7
Vcells 5987713 45.7 12255594 93.6 9510557 72.6
# Download the risk_MSA dataset directly from GitHuburl <-"https://raw.githubusercontent.com/dicook/mulgar_book/master/data/risk_MSA.rds"MSA <-as.data.frame(readRDS(url(url))) # store the full dataset in object MSA# ── 2. Keep only numeric columns ─────────────────────────────────library(dplyr)library(cluster) # silhouette() (optional quality check)MSAnumeric <- MSA %>%select(where(is.numeric)) # extract numeric variables onlyMSAnumeric <-scale(MSAnumeric)# ── 3. Quick checks ─────────────────────────────────────────────str(MSA) # structure of the full dataset
'data.frame': 563 obs. of 6 variables:
$ Recreational: int 3 1 2 1 5 5 1 5 1 3 ...
$ Health : int 2 1 2 1 4 2 1 5 1 5 ...
$ Career : int 1 1 1 1 1 5 1 5 1 2 ...
$ Financial : int 2 1 1 1 3 3 1 5 1 1 ...
$ Safety : int 2 1 2 1 5 2 1 5 1 4 ...
$ Social : int 4 1 3 1 5 3 1 5 1 5 ...
Recreational Health Career Financial
Min. :-1.1252 Min. :-1.2096 Min. :-0.984611 Min. :-1.08778
1st Qu.:-1.1252 1st Qu.:-0.3432 1st Qu.:-0.984611 1st Qu.:-1.08778
Median :-0.1797 Median :-0.3432 Median :-0.006946 Median :-0.02823
Mean : 0.0000 Mean : 0.0000 Mean : 0.000000 Mean : 0.00000
3rd Qu.: 0.7658 3rd Qu.: 0.5232 3rd Qu.:-0.006946 3rd Qu.:-0.02823
Max. : 2.6568 Max. : 2.2560 Max. : 2.926048 Max. : 3.15043
Safety Social
Min. :-1.3216 Min. :-0.99179
1st Qu.:-0.2780 1st Qu.:-0.99179
Median :-0.2780 Median :-0.01731
Mean : 0.0000 Mean : 0.00000
3rd Qu.: 0.7655 3rd Qu.: 0.95717
Max. : 2.8527 Max. : 2.90613
####################################### 1. Clustering ######################################## Goal: compare 6-cluster solutions obtained with different ## • distance metrics: "euclidean", "manhattan" ## • linkage criteria: "single", "complete", "ward.D2" ## NOTE: Ward’s linkage is only valid with Euclidean distances. ### --- helper to build a k = 6 solution ------------------------------------hclust_k6 <-function(x, dist_method, link_method) { d <-dist(x, method = dist_method) hc <-hclust(d, method = link_method)cutree(hc, k =6)}## --- combinations to evaluate -------------------------------------------distances <-c("euclidean", "manhattan")linkages <-c("single", "complete", "ward.D2")# keep only valid combos (Ward ↔︎ Euclidean)valid_combos <-expand.grid(dist = distances, link = linkages,KEEP.OUT.ATTRS =FALSE, stringsAsFactors =FALSE)#expand.grid() creates a data frame with all combinations of distances and linkages.#KEEP.OUT.ATTRS = FALSE avoids attaching unnecessary metadata to the output.#stringsAsFactors = FALSE keeps string variables as character type instead of converting them to factors.valid_combos <-subset(valid_combos, !(link =="ward.D2"& dist !="euclidean"))## --- compute the 6-cluster solutions -------------------------------------#lapply applies the function defined to each of the elements in the first argument.#In this case the arguments will be each of the rows in valid_combos.# recall that the with function basically takes the data and allow to perform any instruction with it.cluster_solutions <-lapply(seq_len(nrow(valid_combos)), function(i) {with(valid_combos[i, ],hclust_k6(MSAnumeric, dist_method = dist, link_method = link))})names(cluster_solutions) <-apply(valid_combos, 1, paste, collapse ="_")## --- (optional) silhouette widths per solution ---------------------------sil_widths <-sapply(seq_along(cluster_solutions), function(i) { d <-dist(MSAnumeric, method = valid_combos$dist[i])mean(silhouette(cluster_solutions[[i]], d)[, "sil_width"])})sil_widths
We computed all combinations between distance metrics (euclidean, manhattan) and linkage methods (single, complete, ward.D2) to generate hierarchical clustering solutions. For the specific case of Ward’s linkage (ward.D2), we only used the Euclidean distance. This is because Ward’s method relies on minimizing the total within-cluster variance at each step of the merging process—a property that only holds under the Euclidean metric.
To evaluate the quality of each clustering solution, we computed the silhouette index for every observation. The silhouette value for an observation \(i\) is defined as:
\(a(i)\) is the average dissimilarity of \(i\) to all other points in the same cluster (cohesion),
\(b(i)\) is the minimum average dissimilarity of \(i\) to all points in any other cluster (separation).
The closer \(s(i)\) is to 1, the better the assignment. Values near 0 indicate uncertainty, and negative values suggest misclassification.
❓ Which combinations make sense based on what we know about the method and the data?
Based on both methodological considerations and the structure of the data, the combinations that make the most sense are those using the Euclidean distance:
✅ Methodologically appropriate combinations:
euclidean + single
euclidean + complete
euclidean + ward.D2 ← Required for Ward’s method
🚫 Less appropriate combinations:
manhattan + single
manhattan + complete
While these are technically possible, they are less interpretable in this setting because:
The dataset consists of standardized, discrete survey responses (Likert scale), which are better suited to Euclidean distance.
Ward’s method explicitly relies on minimizing within-cluster variance, a property that only holds under the Euclidean metric. Using any other distance with ward.D2 would violate its assumptions.
📊 Comment on Silhouette Results
The silhouette score quantifies the quality of clustering by comparing intra-cluster cohesion to inter-cluster separation. A higher score indicates more meaningful and well-separated clusters.
Distance
Linkage
Silhouette Score
euclidean
single
0.2562 ← Highest
manhattan
single
0.2453
euclidean
complete
0.0896
manhattan
complete
0.1028
euclidean
ward.D2
0.1191
Key observations:
The best-performing combination was euclidean + single with a silhouette score of 0.2562, though this is still a moderate value.
Both complete linkage variants performed poorly (scores around 0.09–0.10), indicating poor cluster separation.
ward.D2 with Euclidean produced only a modest silhouette score (0.1191), despite being the only valid pairing for that linkage method.
All scores are relatively low, suggesting the dataset lacks a strong natural clustering structure — a conclusion consistent with the grand tour visual analysis.
3. Dendogram in 2d.
The following code do the dendograms in 2d.
# 1. Determine a tidy layoutn_plots <-nrow(valid_combos)n_cols <-3# up to 3 columns looks goodn_rows <-ceiling(n_plots / n_cols)op <-par(mfrow =c(n_rows, n_cols), mar =c(4, 4, 3, 1)) # save old par# 2. Loop over each combination and draw the dendrogramfor (i inseq_len(n_plots)) { dist_m <- valid_combos$dist[i] link_m <- valid_combos$link[i]# Compute distance matrix and hierarchical clustering d <-dist(MSAnumeric, method = dist_m) # uses the numeric data already loaded hc <-hclust(d, method = link_m)# Plot dendrogramplot( hc,labels =FALSE,hang =-1,main =paste0("Distance: ", dist_m,"\nLinkage: ", link_m," (k = 6)") )# Highlight the 6-cluster cutrect.hclust(hc, k =6, border =2:7)}# 3. Restore original graphics settingspar(op)
🔍 Visual Assessment of the Dendrograms
euclidean + single
The tree is extremely tall and narrow, with merges happening one point at a time.
This indicates weak group cohesion and a highly fragmented structure.
The final 6 clusters appear arbitrarily cut, with no meaningful group separation.
manhattan + single
Similar to the previous one, but slightly more compressed vertically.
Still shows a clear chaining effect, where clusters are formed by connecting distant points.
Cluster boundaries are visually unclear.
euclidean + complete
The dendrogram is more balanced, with earlier merges forming thicker branches.
The 6 final clusters are more compact and visibly distinct than in single linkage.
However, there’s still considerable overlap near the bottom.
manhattan + complete
Visually similar to the Euclidean version, but the height scale is more stretched.
Clusters are well formed, though not entirely separated.
The cuts produce six dense and somewhat symmetrical groups.
euclidean + ward.D2
The cleanest and most structured tree of all.
The dendrogram shows even spacing, short internal branches, and clear vertical separation.
The 6 clusters are visibly well-defined, compact, and evenly sized.
✅ Overall Impression
Only euclidean + ward.D2 shows clearly interpretable and compact groupings.
Both single linkage plots result in long, thin, noisy trees with no meaningful structure.
complete linkage performs visually better, especially with manhattan distance, but still lacks sharp separation.
Ward’s method offers the most visually coherent and trustworthy clustering.
4. Confussion Matrix.
First, lets declare a function that calculates the confussion matrix for the combinations we are looking for
library(dplyr)library(tidyr)library(gt)# 1. Pair the clustering assignmentsconf_mat <-function(method1, method2) {# Build a 6 × 6 contingency table mat <-table(factor(cluster_solutions[[method1]], levels =1:6),factor(cluster_solutions[[method2]], levels =1:6) )# Optional: add readable dim-namesdimnames(mat) <-list(paste0(method1, "_c", 1:6),paste0(method2, "_c", 1:6) )return(mat) # <-- really returns the matrix}mat1 <-conf_mat("manhattan_complete","euclidean_ward.D2")mat1
The confusion matrix reveals a substantial mismatch between the two clustering solutions. Only one cluster from the Manhattan + Complete solution (c2) shows a clean correspondence with a single Ward cluster (ward_c4). All other clusters are highly fragmented, with their members distributed across multiple clusters in the alternative solution.
Most notably: - mc_c1, the largest cluster, is scattered across five different Ward clusters, especially ward_c2 (101) and ward_c3 (115). - mc_c5 is nearly evenly split across ward_c3, ward_c5, and ward_c6, indicating strong disagreement. - mc_c3 is also fragmented, while mc_c6 is small and scattered.
This suggests that Manhattan + Complete tends to form broader, less internally consistent clusters, while Euclidean + Ward.D2 prefers tighter, more compact partitions, possibly dividing heterogeneous groups identified by the former.
Further investigation — for example, using animate_xy() or detour() — can help visually diagnose how these disagreements arise and whether they correspond to outlier substructures in the data.
library(tibble)library(dplyr)library(liminal)library(detourr)library(plotly)library(crosstalk)library(viridisLite)library(bsplus) # for bscols()library(conflicted) # si no se carga automáticamenteconflicts_prefer(plotly::layout) # dile a conflicted que use la de plotly# 1. Combine numeric data with cluster labelsdat <-as_tibble(MSAnumeric) %>%mutate(cl_mc =factor(cluster_solutions[["manhattan_complete"]]),cl_w =factor(cluster_solutions[["euclidean_ward.D2"]]),cl_mc_j =jitter(as.numeric(cl_mc), amount =0.25), ##This disgregates points, separates them. cl_w_j =jitter(as.numeric(cl_w), amount =0.25) )# 2. SharedData for linked brushingdat_shared <- SharedData$new(dat)# 3. Grand-tour plot (tell detour to use *only* numeric columns)#This function creates a classic tour that can be played and colour each dot based on his cluster assignation.set.seed(123)detour_plot <-detour( dat_shared,tour_aes(projection =all_of(colnames(MSAnumeric)), # only real numeric varscolour = cl_mc # cluster colouring )) %>%tour_path(grand_tour(2), max_bases =100, fps =60) %>%show_scatter(alpha =0.7,axes =FALSE,width ="100%",height ="450px",background ="white", # <-- light backgroundpalette = viridisLite::viridis(6) # nice, bright colours )# 4. Jittered confusion scatterconf_plot <-plot_ly( dat_shared,x =~cl_mc_j,y =~cl_w_j,color =~cl_mc,colors =viridis(6, option ="D"),type ="scatter",mode ="markers",marker =list(size =6),height =450) %>%layout(xaxis =list(title ="Manhattan-Complete clusters"),yaxis =list(title ="Euclidean-Ward.D2 clusters"),showlegend =FALSE ) %>%highlight(on ="plotly_selected",off ="plotly_doubleclick")# 5. Display side-by-sidebscols(detour_plot, conf_plot, widths =c(6, 6))
5. Results using K-means
library(tibble)library(dplyr)library(detourr)library(viridisLite)library(crosstalk)library(htmltools) # <- aquí vive `div` (o `tags`)set.seed(123) # reproduciblekm6 <-kmeans(MSAnumeric, centers =6, nstart =50)pts <-as_tibble(MSAnumeric) %>%mutate(km6 =factor(km6$cluster), # k-means labelskind ="point"# will become the glyph/shape )centres <-as_tibble(km6$centers) %>%mutate(km6 =factor(1:6), # 1‒6 for the six centroidskind ="centre" )tour_df <-bind_rows(pts, centres)# ── 3. SharedData for brushing (centres always remain visible) ──tour_shared <- SharedData$new(tour_df)# ── 4. Grand-tour: data points + cluster means ──────────────────set.seed(321)tour_km <-detour( tour_shared,tour_aes(projection =all_of(colnames(MSAnumeric)), # all numeric varscolour = km6, # colour by k-means clusterglyph = kind # different glyph for points vs centres )) %>%tour_path(grand_tour(2), max_bases =100, fps =60) %>%show_scatter(alpha =0.5, # lighter points → less over-plottingpoint_size =6, # centres draw larger automaticallyaxes =FALSE,width ="100%",height ="500px",palette = viridisLite::viridis(6) )# ── Display4tour_km
Variation Within the Clusters
From the grand tour visualization, we can observe that the clusters formed by k-means appear highly overlapping and compacted in the center of the projection space. There is little visual separation between the groups, and no clear boundaries are visible during the animation.
This suggests that:
The within-cluster variation is low — most points remain near their respective centroids.
However, the between-cluster variation is also low — clusters are not well-separated from each other.
The k-means solution might be struggling to find distinct group structures in this high-dimensional space.
In short, the clusters are internally consistent but not clearly distinguishable, which could indicate that the data is either inherently continuous, not well-clustered, or that 6 clusters may not be the optimal choice.
Matching Clusters with Relevant Variables
Following the movement of the cluster centroids during the grand tour, it is difficult to clearly match individual clusters to specific variables.
While the centroids do move slightly in different directions as the projection changes, their movement remains mostly confined to a central region, and there is significant overlap between clusters throughout the tour. This suggests that:
No single variable (or small group of variables) dominates the separation of the clusters.
The clusters likely reflect weak, multidimensional variation rather than clear distinctions along individual variables.
The grand tour does not reveal strong, interpretable axes along which particular clusters are clearly separated from others.
To better understand which variables (if any) drive cluster separation, we would need to: - Use projection pursuit or variable loadings from PCA, or - Examine the mean values of variables across clusters numerically.
In summary, based on the grand tour alone, we cannot confidently associate specific clusters with specific variables, indicating that the clustering structure is either weak or driven by a complex combination of many variables.
6. Clusters and Risks Relation
# ─────────────────────────────────────────────────────────────# Projection-pursuit guided tour# • maximises separation of the six k-means clusters# • lets us see which linear combinations of the risk variables# best distinguish the groups# ─────────────────────────────────────────────────────────────library(tourr) # lda_pp() and guided_tour()library(detourr)library(viridisLite)library(crosstalk)## 1. Build an LDA-style projection-pursuit indexpp_index <-lda_pp(km6$cluster) # labels = k-means clusters## 2. Use only the original, scaled points (no centroids needed here)pp_shared <- SharedData$new(pts)## 3. Run the guided tourset.seed(2025)tour_pp <-detour( pp_shared,tour_aes(projection =all_of(colnames(MSAnumeric)), # every numeric risk variablecolour = km6 # colour by k-means cluster )) %>%tour_path(guided_tour(pp_index), max_bases =40, fps =60) %>%show_scatter(alpha =0.6, # semi-transparent pointsaxes =TRUE, # keep axes so loadings are visiblewidth ="100%",height ="450px",background ="white",palette = viridisLite::viridis(6) )# display the interactive guided tourtour_pp
What we just did & why it works
1. Recap of the workflow
Scaled the numeric risk variables (MSAnumeric) so every dimension contributes on the same scale.
Ran k-means (k = 6) to assign each metro area to one of six clusters (purely data-driven, no labels yet).
Fed those cluster labels into a projection-pursuit guided tour:
A tour is an animated sequence of 2-D projections through high-dimensional space.
Instead of wandering randomly, a guided tour looks for directions that optimise a chosen index (here, LDA).
2. What is LDA in this context?
LDA = Linear Discriminant Analysis.
- Classical LDA finds linear combinations of the variables that maximise the ratio of “between-group” to “within-group” variance.
- Mathematically, it solves a generalised eigen-problem
[()^{-1} = ,]
where ({}) and ({}) are pooled covariance matrices.
- The eigen-vectors () give axes that separate the groups (here, the k-means clusters) as cleanly as possible.
In tourr, lda_pp() turns that optimisation criterion into a projection-pursuit index: for any 2-D projection, the index value measures how well that plane discriminates the clusters.
The guided tour uses gradient ascent to slide the projection until that index is (locally) maximised.
3. How the guided tour answers the question
Task: “Use a projection-pursuit guided tour to best separate the k-means clusters. How are the clusters related to the different types of risk?”
Steps we took:
Step
Code element
Purpose
1
lda_pp(km6$cluster)
Build an LDA index using the k-means labels.
2
guided_tour(pp_index)
Tell the tour to maximise that index.
3
detour() + show_scatter()
Visualise the resulting projection in real time.
4
Inspect arrow loadings
Arrows show each risk variable’s weight in the optimal 2-D plane, letting us see which risks separate which clusters.
Because the tour animates successive bases on the optimisation path, we can literally watch the centroids pull apart until the view that maximises between-cluster variance appears. The arrow lengths and directions in that final frame tell us:
Social risk is the dominant axis (longest arrow), splitting the yellow vs. blue clusters.
Financial risk is the next key dimension, isolating the blue cluster.
Health risk tilts the green cluster away from the centre.
The remaining risk types (Recreational, Career, Safety) have minor roles; their arrows stay short or only emerge in later frames.
Hence we linked which clusters correspond to which dominant risk profiles—all derived objectively from the guided tour using LDA as the discriminating criterion.
Projection-pursuit guided tour: what drives the k-means clusters?
The LDA-guided tour looked for a linear projection that maximises the between-cluster separation of the six k-means groups.
In that optimal view the six risk dimensions appear as black arrows; the longer (and more stable) the arrow, the more strongly that variable contributes to separating the clusters.
Risk dimension (arrow)
Contribution in the optimal view
Cluster pattern it explains
Social
Strongest – arrow is long and stable
Splits the yellow cluster (high Social) from the blue cluster (low Social).
Financial
Second strongest
Helps separate the blue cluster (high Financial) from the rest.
Health
Moderate
Pulls the green cluster downward (higher Health risk).
Recreational
Weak
Fine-tunes the position of central clusters.
Career
Very weak
Minor tweaks within the dense centre.
Safety
Almost zero in the first projection – appears only in later bases
Explains a slight shift of the dark-purple cluster in later frames.
Cluster-by-cluster interpretation
Yellow cluster Largest positive loading onSocial → members show the highest social-risk scores, average on all other risks.
Blue cluster
Negative on Social but strongly positive on Financial → low social risk, high financial risk.
Green cluster
Pulled downward along Health → greater health-related risk than the other groups.
Turquoise & purple clusters
Remain near the origin; arrows for Recreational and Career give them a slight tilt, so they represent moderate, mixed risk profiles rather than an extreme in any one dimension.
Dark-purple cluster
Becomes distinguishable only when the tour rotates toward the Safety dimension, indicating somewhat elevated safety risk while staying ordinary on the major axes.
Take-away
The guided tour shows that the six k-means clusters are primarily organised by three risk dimensions – Social, Financial, and Health. The remaining risk types (Recreational, Career, Safety) contribute little additional separation and mainly fine-tune the positions of clusters that are otherwise tightly packed in the centre.
7. Comparing the k-means solution with the hierarchical (Manhattan–Complete) solution
# ─────────────────────────────────────────────────────────────# EXTRA CHUNK ─ Confusion matrix: k-means (k = 6) vs. your# selected hierarchical solution# ─────────────────────────────────────────────────────────────# 1. Pick the hierarchical solution you want to compare.# (change the name below if you prefer another linkage/distance pair)hier_method <-"manhattan_complete"# <- chosen hierarchical combo# 2. Extract the cluster labelscl_hier <- cluster_solutions[[hier_method]] # hierarchical labels (1–6)cl_km <- km6$cluster # k-means labels (1–6)# 3. Build the 6×6 confusion matrixconf_km_vs_hier <-table(Hierarchical =factor(cl_hier, levels =1:6),Kmeans =factor(cl_km, levels =1:6))# 4. Print the raw countsconf_km_vs_hier
Comparison of k-means vs Hierarchical Clustering (Euclidean–Ward.D2)
The 6 × 6 table below cross-classifies every observation by its hierarchical‐cluster label (rows) and its k-means label (columns):
Hierarchical ↓ / k-means →
1
2
3
4
5
6
1
40
6
1
4
3
0
2
2
0
2
0
0
101
3
18
4
100
0
2
47
4
4
0
0
29
0
0
5
2
68
26
0
10
0
6
18
13
4
6
52
1
Quick summary of the overlap
Hierarchical cluster
Dominant k-means cluster
Share of that row
Comment
1
1 (40/54)
74 %
Mostly overlaps with k-means cluster 1, but with some fragmentation.
2
6 (101/105)
96 %
Very strong one-to-one match with cluster 6.
3
3 (100/171)
58 %
Majority in k-means cluster 3, but with spillover into clusters 1 and 6.
4
4 (29/29)
100 %
Perfect alignment with cluster 4.
5
2 (68/106)
64 %
Dominated by cluster 2 but mixed with others.
6
5 (52/94)
55 %
Moderately clean overlap with cluster 5.
Are the groupings mostly similar?
Some strong overlaps. Hierarchical clusters 2 and 4 show very clean alignment with k-means clusters 6 and 4, respectively.
Moderate to high fragmentation elsewhere. Cluster 3 contains a majority from k-means cluster 3, but also draws from clusters 1 and 6. Clusters 5 and 6 show more diffuse membership.
Overall match ≈ 52 %. Counting only the dominant cell in each row (bolded) gives 289 of 559 correct matches.
In summary, k-means and hierarchical clustering with Ward’s method identify some common structure, but produce notably different groupings, especially for more ambiguous regions of the data.
Pred
True Indonesia Australia
Indonesia 319 164
Australia 142 428
Comparison of Math Performance: Australia vs Indonesia (PISA Sample)
We analyzed a stratified 10% sample from the PISA dataset, comparing five plausible values (PV1MATH–PV5MATH) of mathematics scores between Australia and Indonesia. The goal was to assess:
Whether there are statistically significant differences between the countries,
The size and direction of those differences,
Whether the countries are separable based on math scores using linear classification.
Summary Statistics
For each plausible value, the average score in Australia was substantially higher than in Indonesia:
Country
PV
Mean Score
SD
N
Indonesia
PV1MATH
404.1
89.4
483
Australia
PV1MATH
492.9
92.1
570
Indonesia
PV2MATH
403.4
86.9
483
Australia
PV2MATH
494.2
89.4
570
Indonesia
PV3MATH
401.3
90.3
483
Australia
PV3MATH
492.6
90.5
570
Indonesia
PV4MATH
402.6
88.8
483
Australia
PV4MATH
493.0
93.3
570
Indonesia
PV5MATH
399.2
90.2
483
Australia
PV5MATH
495.6
91.2
570
T-tests and Effect Sizes
All five tests returned very strong evidence of difference:
PV
Mean Difference
p-value
Cohen’s d
Magnitude
PV1MATH
89.7
< 1e-50
-0.99
Large
PV2MATH
90.8
< 1e-50
-1.03
Large
PV3MATH
91.3
< 1e-50
-1.01
Large
PV4MATH
90.4
< 1e-50
-0.99
Large
PV5MATH
96.4
< 1e-50
-1.06
Large
Visual Inspection
The density plot of PV1MATH shows a clear shift: Australia’s distribution is centered nearly 90 points to the right of Indonesia’s, consistent with the t-tests.
Separability via LDA
A Linear Discriminant Analysis model trained on all five plausible values:
10-fold cross-validated accuracy: 70.5%
Confusion matrix (full sample):
11. Chapter 16, question 1, 2, 3
Solution
Question 1.
# ================================================================# Linear SVM vs LDA : Bushfire data (arson vs lightning)# ================================================================# Packages --------------------------------------------------------if (!requireNamespace("mulgar", quietly =TRUE)) install.packages("mulgar")if (!requireNamespace("cowplot", quietly =TRUE)) install.packages("cowplot")library(mulgar) # bushfire datalibrary(dplyr)library(e1071) # SVMlibrary(MASS) # LDAlibrary(ggplot2)library(cowplot)# Load data -------------------------------------------------------bushfire <- mulgar::bushfires # NOTE: object name is 'bushfires'# Keep only arson / lightning and the needed predictors -----------bf <- bushfire %>%filter(tolower(cause) %in%c("arson", "lightning")) %>% dplyr::select(log_dist_cfa, log_dist_road, cause) %>%mutate(cause =factor(tolower(cause))) %>%na.omit()# Fit linear SVM --------------------------------------------------svm_fit <-svm( cause ~ log_dist_cfa + log_dist_road,data = bf,kernel ="linear",scale =TRUE)# Fit LDA ---------------------------------------------------------lda_fit <-lda(cause ~ log_dist_cfa + log_dist_road, data = bf)# Prediction grid -------------------------------------------------make_grid <-function(df, n =300) {expand.grid(log_dist_cfa =seq(min(df$log_dist_cfa), max(df$log_dist_cfa), length.out = n),log_dist_road =seq(min(df$log_dist_road), max(df$log_dist_road), length.out = n) )}grid <-make_grid(bf)grid$svm_z <-as.numeric(predict(svm_fit, grid) =="arson")grid$lda_z <-as.numeric(predict(lda_fit, grid)$class =="arson")# SVM plot --------------------------------------------------------p_svm <-ggplot(bf, aes(log_dist_cfa, log_dist_road, colour = cause)) +geom_point(size =2, alpha =0.65) +geom_contour(data = grid, aes(z = svm_z), breaks =0.5, colour ="black") +geom_point(data = bf[svm_fit$index, ],aes(log_dist_cfa, log_dist_road),shape =4, size =3, stroke =1.2, colour ="black") +labs(title ="Linear SVM decision boundary",subtitle ="Support vectors are shown with x") +theme_minimal() +theme(legend.position ="bottom")# LDA plot --------------------------------------------------------p_lda <-ggplot(bf, aes(log_dist_cfa, log_dist_road, colour = cause)) +geom_point(size =2, alpha =0.65) +geom_contour(data = grid, aes(z = lda_z), breaks =0.5, colour ="black") +labs(title ="LDA decision boundary") +theme_minimal() +theme(legend.position ="bottom")# Show both -------------------------------------------------------cowplot::plot_grid(p_svm, p_lda, labels =c("A", "B"))
🔥 SVM vs LDA on Bushfire Causes: Arson vs Lightning
🔍 Task
We generate a subset of the bushfire dataset, retaining only the variables:
log_dist_cfa (log-distance to fire authority),
log_dist_road (log-distance to nearest road), and
cause (fire cause),
and filter to keep only the observations where the cause is either “lightning” or “arson”. We then fit:
A linear Support Vector Machine (SVM) model to classify the two causes, and
A Linear Discriminant Analysis (LDA) model for comparison.
We visualize and compare both classification boundaries.
📈 Results
Panel A (left) shows the linear SVM decision boundary with support vectors marked as ×.
Panel B (right) shows the LDA decision boundary.
📊 Interpretation and Comparison
Feature
SVM
LDA
Decision boundary
Placed to maximize the margin between classes, relying only on support vectors (boundary points).
Based on group means and pooled covariance, assuming Gaussian class distributions.
Behavior in this data
SVM positions the boundary cleanly within the gap between classes, aligned with the sparsest region of overlap.
LDA boundary is slightly shifted, reflecting class imbalance and covariance structure.
Robustness to outliers
More robust — focuses only on a subset of the most critical observations.
Sensitive to distribution shape and unequal variances.
Visual conclusion
The SVM boundary more intuitively matches the visual separation seen in the plot.
The LDA boundary cuts through denser regions, potentially misclassifying borderline observations.
✅ Conclusion
Both SVM and LDA attempt to separate fires caused by lightning vs arson using spatial proximity variables. However, SVM achieves a boundary that aligns better with the natural gap between clusters in the data, due to its reliance on margin maximization and support vectors. In contrast, LDA relies on distributional assumptions, which may result in less optimal boundaries when variances or densities differ between classes.
Question 2.
# 1. Prepare the data ------------------------------------------bf4 <- mulgar::bushfires %>%# NOTE: object name is 'bushfires'filter(tolower(cause) %in%c("arson", "lightning")) %>% dplyr::select(log_dist_cfa, log_dist_road, amaxt180, amaxt720, cause) %>%mutate(cause =factor(tolower(cause))) %>%# standardise labelsna.omit() # drop rows with any NAs# 2. Fit **linear** SVM ----------------------------------------svm4 <-svm( cause ~ .,data = bf4,kernel ="linear",scale =TRUE# standardise predictors → fair coefficient comparison)# 3. Extract the hyper-plane coefficients ----------------------# w = Σ_i α_i * y_i * x_i (only support vectors contribute)w <-as.vector(t(svm4$SV) %*% svm4$coefs)names(w) <-colnames(bf4)[1:4] # attach variable names# Intercept (note: sign convention matches e1071 docs)b <--svm4$rho# 4. Rank variables by |weight| --------------------------------importance <-sort(abs(w), decreasing =TRUE)importance
barplot(importance,main ="Variable importance via |SVM weight|",ylab ="|w| (higher = more important)",las =2)
🔥 Multivariate Linear SVM: Identifying Important Predictors of Fire Cause
🔍 Task
We extended the previous classification of bushfire causes by fitting a linear SVM model using four predictors:
log_dist_cfa (log-distance to fire authority),
log_dist_road (log-distance to nearest road),
amaxt180 (average max temperature over the past 180 days), and
amaxt720 (average max temperature over the past 720 days).
Our goal was to compute the separating hyperplane and evaluate variable importance based on the learned weights.
⚙️ Method
We standardized all predictors and fit a linear SVM to classify observations as either arson or lightning. In a linear SVM, the hyperplane coefficients (weights) represent the importance of each variable in separating the classes when data are scaled.
We extracted the weights ( ) from the SVM solution: [ = _i _i y_i _i ] and ranked predictors by ( |w_j| ).
📈 Results
The bar plot above shows the absolute value of the SVM weights for each predictor. Higher values imply greater importance in the separating decision function.
🧠 Interpretation
amaxt180 has the highest weight, suggesting it is the most important predictor for distinguishing arson from lightning fires.
amaxt720 and log_dist_cfa follow closely, indicating that both long-term temperature and proximity to firefighting authorities contribute meaningfully.
log_dist_road had the smallest weight, suggesting it is least informative among the four variables used.
This result suggests that temperature history, especially over the last 180 days, is a critical factor in separating arson from lightning fire causes — potentially due to patterns of human activity versus natural fire conditions.
✅ Conclusion
By extending the SVM with additional predictors, we gain insight into which features are most relevant for distinguishing causes of bushfires. This approach highlights the strength of linear SVMs in feature weighting, especially when data are standardized and the separating hyperplane is interpretable.
References
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016
GAI Use
The use of AI was reported in the corresponding links inserted within the development of each section. Additional queries were made (not reported here), but these were solely aimed at improving the writing of already developed sections—such as their coherence, indentation, and so on.
📊 Comment on Silhouette Results
The silhouette score quantifies the quality of clustering by comparing intra-cluster cohesion to inter-cluster separation. A higher score indicates more meaningful and well-separated clusters.